Hi!¶

I'm Julia.

In [ ]:

%pylab inline
import pandas as pd
pd.set_option('display.mpl_style', 'default')
figsize(15, 6)
pd.set_option('display.line_width', 4000)
pd.set_option('display.max_columns', 100)

Goal (Today)¶

Know how to use pandas to answer some specific questions about a dataset

Roadmap:

Demo with rats
Dataframes: what makes pandas powerful
Selecting data from a dataframe
Time series and indexes and resampling
Groupby + aggregate

Some notes about installation:¶

Don't do this:¶

sudo apt-get install ipython-notebook

Instead, do this:¶

pip install ipython tornado pyzmq

or install Anaconda from http://store.continuum.io (what I do)

You can start IPython notebook by running

ipython notebook --pylab inline

First: Read the data¶

In [45]:

# Download and read the data
!wget "http://bit.ly/311-data-tar-gz" -O 311-data.tar.gz
!wget "https://raw2.github.com/jvns/talks/master/pyladiesfeb2014/tiny.csv" -O tiny.csv
!tar -xzf "311-data.tar.gz" # wget does different things
orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])

In [ ]:

plot(orig_data['Longitude'], orig_data['Latitude'], '.', color="purple")

Example 1: Graph the number of noise complaints each hour in New York¶

In [ ]:

complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()

Example 2: What are the most common complaint types?¶

In [ ]:

orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')

Roadmap:¶

Dataframes: what makes pandas powerful
Selecting data from a dataframe
Time series and indexes
Graphing

What is pandas?¶

A few awesome things about pandas¶

Really, really, really, really good at time series
Can import Excel files (!!!)
Fast (joining dataframes, etc.)

This is what lets you manipulate data easily -- the dataframe is basically the whole reason for pandas. It's a powerful concept from the statistical computing language R.

If you don't know R, you can think of it like a database table (it has rows and columns), or like a table of numbers.

1. Dataframes: what makes pandas powerful¶

In [ ]:

people = pd.read_csv('tiny.csv')
people

This is a like a SQL database, or an R dataframe. There are 3 columns, called 'name', 'age', and 'height, and 6 rows.

2. Selecting data from a dataframe¶

I want you to know about this because you almost always only want a subset of the data you're working on. We are going to look at a CSV with 40 columns and 1,000,000 rows.

In [ ]:

# Load the first 5 rows of our CSV
small_requests = pd.read_csv('./311-service-requests.csv', nrows=5)

In [ ]:

# How to get a column
small_requests['Complaint Type']

In [ ]:

# How to get a subset of the columns
small_requests[['Complaint Type', 'Created Date']]

In [ ]:

# How to get 3 rows
small_requests[:3]

Get the first 3 rows of a column¶

In [ ]:

small_requests['Agency Name'][:3]

In [ ]:

small_requests[:3]['Agency Name']

Compare a column to a value¶

In [ ]:

small_requests['Complaint Type']

In [ ]:

# This is like our numpy example from before
small_requests['Complaint Type'] == 'Noise - Street/Sidewalk'

That's numpy in action! Using == on a column of a dataframe gives us a series of True and False values

Selecting only the rows with noise complaints¶

In [ ]:

# This is like our numpy example earlier
noise_complaints = small_requests[small_requests['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints

Any Dataframe has an index, which is a integer or date or something else associated to each row.

In [ ]:

# How to get a specific row
small_requests.ix[0]

In [ ]:

# How not to get a row
small_requests[0]

Exercise 1: Selecting things from dataframes¶

Find out how many complaints were filed with the NYPD
How many complaints were filed in the zip code 10007?

In [ ]:

# Your code here

In [ ]:

Back to our example¶

In [ ]:

# We ran this at the beginning, so we don't have to run it again. Just here as a reminder.
#orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])

In [ ]:

complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()

Indexes¶

In [ ]:

noise_complaints[:3]

In [ ]:

noise_complaints = noise_complaints.set_index('Created Date')

In [ ]:

noise_complaints[:3]

Sorting the index¶

Pandas is awesome for date time index stuff. It was built for dealing with financial data is which is ALL TIME SERIES

In [ ]:

noise_complaints = noise_complaints.sort_index()
noise_complaints[:3]

Counting the complaints each hour¶

In [ ]:

noise_complaints.resample('H', how=len)[:3]

Example 1: done!¶

In [ ]:

noise_complaints.resample('H', how=len).plot()

Chaining commands together¶

In [ ]:

complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()

Exercise 2: Time series resampling¶

Find the number of noise complaints every day!
Find how many complaints about rodents there are each week. Make a graph!

Example 2: What are the most common complaint types?¶

In [ ]:

orig_data['Complaint Type'].value_counts()

In [ ]:

orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')

Exercise 3: Do the same thing for a different column¶

In [ ]:

# Your code here.

Example 3: Which weekday has the most noise complaints?¶

In [ ]:

complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints = noise_complaints.set_index("Created Date")

In [ ]:

noise_complaints['weekday'] = noise_complaints.index.weekday
noise_complaints[:3]

In [ ]:

# Count the complaints by weekday
counts_by_weekday = noise_complaints.groupby('weekday').aggregate(len)
counts_by_weekday

In [ ]:

# change the index to be actual days
counts_by_weekday.index = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]

In [ ]:

counts_by_weekday.plot(kind='bar')

Exercise 4: Count the complaints by hour instead¶

In [ ]:

# Your code here

In [ ]:

A few more cool things¶

String searching¶

In [ ]:

# We need to get rid of the NA values for this to work
street_names = orig_data['Street Name'].fillna('')

In [ ]:

manhattan_streets = street_names[street_names.str.contains("MANHATTAN")]
manhattan_streets

In [ ]:

manhattan_streets.value_counts()

Looking at complaints close to us¶

In [ ]:

# Our current latitude and longitude
our_lat, our_long = 40.714151,-74.00878

In [ ]:

distance_from_us = (orig_data['Longitude'] - our_long)**2 + (orig_data['Latitude'] - our_lat)**2

In [ ]:

pd.Series(distance_from_us).hist()

In [ ]:

close_complaints = orig_data[distance_from_us < 0.00005]

In [ ]:

close_complaints['Complaint Type'].value_counts()[:20].plot(kind='bar')